In this analysis report I would like to find some patterns or characteristics that make some players the best.
After analysing the data I will use data analysis and statistics techniques using the libraries Pandas, Numpy and Matplotlib to manage datasets, make vectorized calculations and create visualisations to help understand the results.
In [1]:
# import libraries
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
%pylab inline
from bokeh.io import output_notebook, show
from bkcharts import Donut
output_notebook()
In [2]:
# declare helper functions and load dataset
def load_files(data_path, filenames):
"""Load data files
Loads every csv file from the data directory.
Args:
data_path(string): Path to files
filenames(list): List of files to load from the data_path
Returns:
dict: A dictionary of table names as keys and DataFrame objects as values
"""
dfs = {}
for f in filenames:
parts = f.split('.')
if parts[1] != 'csv':
continue
dfname = parts[0]
absfilename = data_path + '/' + f
dfs[dfname] = pd.read_csv(absfilename)
return dfs
def describe_df(df):
"""Describe a DataFrame
Prints DataFrame information, the result of describe() method, and a blank line underneath.
Args:
df(DataFrame): The DataFrame to be described
"""
print(df)
display(dataframes[df].describe())
print('')
def describe_dataframes(dataframes):
"""Describe a list of DataFrame objects
Args:
dataframes(list): A list of DataFrame objects
"""
for df in dataframes:
describe_df(df)
def display_1d_array(array, col_name='Data'):
"""Display 1D Numpy array nicely
Args:
array (np.array): 1D Numpy array
col_name (string, optional): Name for column to display
"""
display(pd.DataFrame(array, columns=[col_name], index=[i+1 for i in range(len(array))]))
In [3]:
# declare visualization functions
def distribution(data, figsize=(7, 5), color='c', xlabel='X', ylabel='Y', title=None):
"""Draw a Seaborn Distribution plot
Draws a distribution plot setting proper size and title
Args:
data(DataFrame): Data to be used to draw
color(string, optional): Single character that represents a colour setting. Defaults to c
xlabel(string, optional): Label for X axis. Defaults to X
ylabel(string, optional): Label for Y axis. Defaults to Y
title(string, optional): Title to be displayed on top of the plot. Defaults to None
.. _Seaborn Documentation:
https://seaborn.pydata.org/generated/seaborn.distplot.html
"""
f, axes = plt.subplots(1, 1, figsize=(7, 5), sharex=True)
sns.despine(left=True)
sns.distplot(data, color=color, kde_kws={"shade": True})
axes.set(xlabel=xlabel, ylabel=ylabel)
plt.setp(axes, yticks=[])
plt.tight_layout()
if title is not None:
f.suptitle(title)
def plot_correlation(x, y, xlabel, ylabel, legend=None, title=None):
"""Draw a Seaborn Regplot plot
Plot data and a linear regression model fit.
Args:
x(array): X data in Numpy array format
y(array): Y data in Numpy array format
xlabel(string): Label for X axis
ylabel(string): Label for Y axis
legend(string, optional): Title to be displayed inside the plot. Defaults to None
title(string, optional): Title to be displayed on top of the plot. Defaults to None
.. _Seaborn Documentation:
https://seaborn.pydata.org/generated/seaborn.regplot.html
"""
if title is not None:
fig = plt.figure()
fig.suptitle(title)
ax = sns.regplot(x="x", y="y", data=pd.DataFrame({'x': x, 'y': y}), label=legend, x_jitter=.2)
ax.set(xlabel=xlabel, ylabel=ylabel)
if legend is not None:
ax.legend(loc="best")
def plot_correlation2(data, x_var, y_var, title=None):
"""Draw a Seaborn Joinplot plot
Draw a plot of two variables with bivariate and univariate graphs.
Args:
data(DataFrame): Data to draw the plot
x_var(string): Name of variable for X axis
y_var(string): Name of variable for Y axis
title(string, optional): Title to be displayed on top of the plot. Defaults to None
.. _Seaborn Documentation:
http://seaborn.pydata.org/generated/seaborn.jointplot.html
"""
sns.set(style="darkgrid", color_codes=True)
g = sns.jointplot(x_var, y_var, data=data, kind="reg",
xlim=(0, data[x_var].max()), ylim=(0, data[y_var].max()), color="r", size=7)
if title is not None:
g.fig.suptitle(title)
In [4]:
# load data
data_path = 'data/baseballdatabank-2017.1/core'
filenames = os.listdir(data_path)
dataframes = load_files(data_path, filenames)
In [5]:
dataframes['Master'].head()
Out[5]:
In [6]:
rows_with_missing_values = len(dataframes['Master'][dataframes['Master'].isnull().any(axis=1)])
total = len(dataframes['Master'])
def bokeh_pie_chart(values, labels):
"""Pie chart
Draws a Bokeh pie chart
Args:
values(list): List of Numpy arrays, one per type of data to show
labels(list): List of labels matching amount of arrays in values argument
"""
data = pd.Series(values, index=labels)
pie_chart = Donut(data)
show(pie_chart)
bokeh_pie_chart([rows_with_missing_values, total-rows_with_missing_values], ['Rows with missing values', 'Rows without missing values'])
As you can see more than 50% of entries, only from Master table (players), have missing values. I could not find any relevant missing values directly on tables that could directly affect my calculations. However I had to deal with missing relationships between players and schools tables as you will see underneath.
To answer any question about best players first I will have define what a best player is and how I will score it.
Best Player: Player who participated on most winning matches through time being part of one or more teams.
Player/Team Score: Ratio from the difference between Wins and Losses over amount of games played, assigned to every team.
In [7]:
# Wins calculation
team_wins_by_year = dataframes['Teams'].groupby(['teamID', 'yearID'], as_index=False)['teamID', 'W'].sum()
team_wins_by_year.sort_values('W', ascending=False).describe()
display(team_wins_by_year.describe())
As you can see underneath, most of teams won between 50 to 100 games in the time window contained in this dataset. I took a 150 teams sample to show the tendency.
In [8]:
team_wins_by_year_sample = team_wins_by_year.sample(n=150)
fig = plt.figure(figsize=(16,10))
ax = sns.stripplot(x="teamID", y="W", data=team_wins_by_year_sample, size=10)
for item in ax.get_xticklabels():
item.set_rotation(45)
ax.set(xlabel='Teams', ylabel='Wins')
fig.suptitle('Teams by Yearly Wins (based on sample of 150)', fontsize=16);
In [9]:
# Team Wins calculation
team_wins_total = team_wins_by_year.groupby('teamID', as_index=False)['W'].sum()
display(team_wins_total.describe())
Some statistical numbers about Team Wins
In [10]:
# Losses calculation
team_losses_by_year = dataframes['Teams'].groupby(['teamID', 'yearID'], as_index=False)['teamID', 'L'].sum()
team_losses_by_year.sort_values('L', ascending=False).describe()
display(team_losses_by_year.describe())
team_losses_by_year_sample = team_losses_by_year.sample(n=150)
Now, showing below the amount of games that most of teams lost, the average is slightly higher (from 60 to 100).
Does it make sense? Well, not everyone can win...
In [11]:
fig = plt.figure(figsize=(16,10))
ax = sns.stripplot(x="teamID", y="L", data=team_losses_by_year_sample, size=8)
for item in ax.get_xticklabels():
item.set_rotation(45)
ax.set(xlabel='Teams', ylabel='Losses')
fig.suptitle('Teams by Yearly Losses (based on sample of 150)', fontsize=16);
In [12]:
# Team losses calculation
team_losses_total = team_losses_by_year.groupby('teamID', as_index=False)['L'].sum()
display(team_losses_total.describe())
And some more statistical numbers about Team Losses
In [13]:
def calc_score(x):
"""Score calculation
Calculates score based on Wins vs. Losses over amount of games
Args:
x(DataFrame): One column DataFrame
Returns:
float: Formula result
"""
return (x['W'] - x['L']) / (x['W'] + x['L'])
Statistical information and first entries of Team Results
In [14]:
# Team results calculation
teams_results = team_wins_total.merge(team_losses_total, on='teamID', how='inner')
tscores = teams_results.apply(calc_score, axis=1)
teams_results['score'] = tscores.values
display(tscores.describe())
display(teams_results.head())
Again, now displayed as a distribution, teams seem to be draw, but slightly negatively skewed. Which has the same meaning as the previous plot: there are more teams losing than winning.
In [15]:
distribution(tscores, xlabel='Score', ylabel='Teams', title='Team Score Distribution')
In [16]:
# Best teams calculation
best_teams = teams_results[teams_results['score'] > teams_results.quantile(q=0.8)['score']].sort_values('score', ascending=False)
display(best_teams.describe())
Now we have only 30 teams. If you check the previous figure the minimum score_norm was -1. Now it is 0.064, and max of 0.664.
In [17]:
distribution(best_teams['score'].values, color='m', xlabel='Score', ylabel='Teams', title='Best Team Score Distribution')
In [18]:
# Player results calculation
players_results = dataframes['Appearances'].merge(teams_results, on='teamID', how='inner').groupby(['playerID'], as_index=False)['W', 'L', 'score'].sum()
pscores = players_results['score'].values
display(players_results.describe())
Now looking only at the best players, the distribution of score is perfectly symmetric.
In [19]:
distribution(pscores, xlabel='Score', ylabel='Players', title='Best Players Score Distribution')
Now looking at the best 2% of players the shape of this positively skewed distribution represents only a small portion on the right side of the previous plot.
In [20]:
# Players participating on games calculation
players_appearences = players_results.merge(dataframes['Appearances'].groupby('playerID', as_index=False)['teamID'].count(), on='playerID', how='left')
best_players_appearences = players_appearences[players_appearences['score'] > players_appearences.quantile(q=0.98)['score']]
distribution(best_players_appearences['score'].values, color='r', xlabel='Score', ylabel='Players', title='2% Best Players Distribution')
In [21]:
def set_to_str(x):
"""Set to string helper function
Converts a set to string
Args:
x(set): Set to be represented as a string
Returns:
string: String representation of set
"""
x = ','.join([str(c) for c in x])
return x
def set_to_count(x):
"""Set to count helper function
Converts a set into the count of its elements
Args:
x(set): Set to be counted
Returns:
int: Number of elements contained in the set
"""
if str(x) == '{nan}':
return 0
return len(x)
schools_of_winners = best_players_appearences.merge(dataframes['CollegePlaying'], on='playerID', how='left')
best_players_appearences_with_schools = best_players_appearences.merge(schools_of_winners, on='playerID', how='left').groupby('playerID', as_index=False).agg({'schoolID': lambda x: set(x)})
best_players_appearences_with_schools['school_count'] = best_players_appearences_with_schools['schoolID'].apply(set_to_count)
display(best_players_appearences_with_schools.merge(dataframes['Master'], on='playerID', how='left')[['playerID', 'nameGiven', 'schoolID', 'school_count']].head())
Almost none of best players report schools
In [22]:
best_players_with_school_and_score = best_players_appearences.merge(best_players_appearences_with_schools, on='playerID', how='left')
top10_schools = best_players_with_school_and_score[best_players_with_school_and_score['school_count'] > 0].sort_values('score', ascending=False).apply({'schoolID': lambda x: ','.join(x)}).iloc[0:10]
display_1d_array(top10_schools.values, col_name='School')
List ot top 10 schools
Apparently participating in more winning games doesn't have a strict correlation with highest salaries, but they are still in a massive salaries list. To calculate the salaries, I took the maximum salary for each player/team.
As you can see underneath, highest salaries didn't go to players of teams with more successful games.
In [23]:
def calculate_mean_salary(x):
count = len(x)
return ((1/count) * np.sum(x)/1000000)
best_players_salaries = dataframes['Salaries'].merge(best_players_with_school_and_score, on='playerID', how='right')[['playerID', 'salary', 'yearID', 'score']].dropna().sort_values(['salary'], ascending=False)
bpsalaries = best_players_salaries['salary'].values
#bscore_norm = MinMaxScaler().fit_transform(bpsalaries.astype(np.float64).reshape(-1, 1)).flatten()
#best_players_salaries['salary_norm'] = bpscore_norm
highest_salaries_players_with_score = best_players_salaries.groupby('playerID', as_index=False).agg({'salary': calculate_mean_salary}).merge(players_results, on='playerID', how='left')
display(highest_salaries_players_with_score[['salary', 'score']].describe())
Statistical information of best players scores vs. salaries (expressed in millions of USD)
In [24]:
max_salary_idx = highest_salaries_players_with_score[['salary']].idxmax()
max_score_idx = highest_salaries_players_with_score[['score']].idxmax()
print('Highest salary (millions)')
display(highest_salaries_players_with_score.loc[max_salary_idx].merge(dataframes['Master'], on='playerID', how='left')[['playerID', 'nameGiven', 'salary', 'score']])
print('Highest score')
display(highest_salaries_players_with_score.loc[max_score_idx].merge(dataframes['Master'], on='playerID', how='left')[['playerID', 'nameGiven', 'salary', 'score']])
Here is the difference between the player getting the highest salary and the one participating the more winning games.
In [25]:
plot_correlation2(data=highest_salaries_players_with_score, x_var="score", y_var="salary", title='Correlation between Salary and Score')
Do best players get higher salaries? It is pretty difficult to see using only the variables on this report. It might be interesting to include more variables like manager biases or getting a better normalization of salaries (money does not have the same value now and 100 years ago).
In [26]:
managers = dataframes['Managers']
mscores = managers.apply(calc_score, axis=1)
managers['score'] = mscores
managers = managers.groupby('playerID', as_index=False).sum()[['playerID', 'score', 'rank']]
best_managers = managers.sort_values('score', ascending=False)
distribution(mscores, color='g', xlabel='Score', ylabel='Managers', title='Manager Score Distribution')
It seems that best teams were managed by the best managers (or managers are considered the best because they won more games). This shows me how important leadership is to succeed.
In [27]:
manager_rank_score_corr = best_managers.sort_values('score', ascending=False).iloc[0:100][['playerID', 'score', 'rank']]
plot_correlation(x=manager_rank_score_corr['rank'], y=manager_rank_score_corr['score'], xlabel='Rank', ylabel='Score', title='Correlation between Manager Score and Ranking')
I think there are many variables that I could include to see whether data could tell us anything more interesting, perhaps using Pitching, Fielding and Batting tables. Even curious patterns like birth or death places, or even doing the opposite job, like finding patterns correlating with losing teams.
The following and last plot shows where players die more often. Please don't play baseball in Philadelphia!!!
In [28]:
player_death_city = dataframes['Master'][['playerID', 'deathCity']].dropna().merge(players_results, on='playerID', how='left')
worst_players = player_death_city.sort_values('score')[['playerID', 'deathCity', 'score']].dropna()
top10_worst = worst_players[worst_players['score'] < worst_players.quantile(q=0.1)['score']].sort_values('score').groupby('deathCity', as_index=False)['playerID'].count().sort_values('playerID', ascending=False).iloc[0:10]
#display(top10_worst.iloc[0:10])
f, ax = plt.subplots(figsize=(10, 7))
sns.barplot(x="deathCity", y="playerID", data=top10_worst)
#sns.stripplot(x="deathCity", y="playerID", data=top10_worst)
ax.set(xlabel='Death City', ylabel='Amount of Dead Players')
f.suptitle('Just for fun: Most deadly cities to play baseball');
I've found that the best-reported player was Henry Louis (not reporting school).
To answer the questions I had at the beginning, I've found that Columbia is the school with more best players. However, there were many missing relationships between players and the schools they went. This means that this school is only the best after ignoring best players who didn't report schools.
About highest salaries, Alexander Emmanuel seems to be the luckiest one with more than USD 18M. On the other hand, the one with the best-reported performance was Derek Sanderson.
Now, the most important fact, in my opinion, is the last question: Is there a correlation between best players and managers? Well, there is. I've found that leadership seems to be strongly involved in success: Players and teams were better with best-ranked managers.
Sean Lahman, Award Winning Database Journalist and Author.
Website: http://www.seanlahman.com/baseball-archive/statistics/